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ABSTRACT 

We consider the partitioning of a problem on a domain with unequal 
work estimates in different subdomains in a way that balances the work load 
across multiple processors. Such a problem arises for example in solving par- 
tial differential equations using an adaptive method that places extra grid 
points in certain subregions of the domain. We use a binary decomposition 
of the domain to partition it into rectangles requiring equal computational 
effort. We then study the communication costs of mapping this partitioning 
onto different multiprocessors: a mesh-connected array, a tree machine and a 
hypercube. The communication cost expressions can be used to determine the 
optimal depth of the above partitioning. 
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1. Introduction 

We consider the partitioning of a problem on a domain with unequal computational 
work estimates in different subdomains, in a way that balances the work load across multiple 
processors. Such a problem arises, for example, in solving hyperbolic partial differential 
equations using an adaptive method that places extra grid points in certain subregions of the 
domain (see e.g. [5] and [6]). Such an approach has also been proposed in the multigrid litera- 
ture ([2], [3], [12], [15]). At a given instant of time a typical computational mesh for either of 
these problems might look like Fig. 1.1. 


Fig. 1.1 Increased resolutions is obtained by superimposing fine grid patches over an 
underlying global coarse grid. 

Hence, any simple partitioning scheme must account for the unequal amount of work to be 
done in the left half versus the right half of the domain. If Fig. 1.1 represents the grid for a 
time dependent problem, the work load in the left and right halves is significantly different, 
since the finest grids in space typically need a smaller step in time (if an explicit finite dif- 
ferent scheme is used to integrate the solution), and so many time steps are taken on the fine 
grid for every one on any coarser grid. 

Other numerical examples which give rise to a problem with unequal work estimates 
might come from solving a pde with different equation sets in different regions. For exam- 
ple, in calculating transonic flow around airfoils, the Navier Stokes equations may be used in 
a boundary layer around the airfoil, and the Euler equations or even the potential equations 
can be used in the farfield. These different sets of equations have very different costs associ- 
ated with their corresponding difference schemes. Another possibility is that the work esti- 
mates come from different physics in different parts of the domain, for example in weather 
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calculations, depending on whether a region is over water or over land. Other examples of 
unequal work include an iteration scheme such as SOR where a certain percentage of the 
domain is relaxed a second time before iterating on the entire domain again. This ad hoc 
procedure, applied to say the 10% of the grid with the largest residual can greatly reduce the 
cost of convergence. 

In addition to the above static domains, we are interested in investigating the possibili- 
ties of solving adaptive mesh refinement problems on a multiprocessor system. A major dif- 
ficulty is that, no matter how portions of the mesh are initially assigned to processors, a 
change in the mesh refinement will ultimately cause the computational load on the processors 
to become unbalanced. Attempts at rebalancing are complicated by the need to keep the 
interprocessor communication overhead at a minimum. Since the adaptive mesh refinement 
strategy in [6] is already based on a partitioning of the domain into rectangular grid patches, 
we can derive an approach presented here which is simple and tractable. Other approaches 
are given in [9], [17]. 

Most partitioning strategies use some type of domain decomposition to balance the work 
load over many processors. Typically, these uniform mesh problems can be divided into 
boxes (Fig. 1.2a) or strips (Fig. 1.2b). 


(a) 


(b) 


Fig. 1.2 Two common partitioning strategies for rectangular mesh problems. 

The benefits of one over the other depend on the cost of transferring information around the 
perimeter of a box to the neighboring partition/processor (which depends on the machine 
architecture), and the order in which computations on such configurations can proceed. 
Papadimitriou and Ullraan [13] discuss communication/time tradeoffs for such partitionings, 
and obtain lower bounds on the rate such computations can proceed. 
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A different kind of partitioning is evident in the work of Adams and Jordan [1]. By 
partitioning down to the grid point level, using a multi-color SOR iteration scheme, simul- 
taneous updates can proceed for any given color grid point throughout the entire mesh. This 
can be useful on processor arrays as well as vector computers. 

This paper is organized as follows. We describe the binary decomposition used to parti- 
tion the work load in section 2, and discuss some of its properties in section 3. In sections 4, 

5 and 6, we study the communication costs of mapping this partitioning onto different types 
of multiprocessors: a nearest neighbor array, a tree machine, and a hypercube. We derive 
expressions for the communication versus computation costs which can be used to determine 
an optimal depth for the above partitioning. Section 7 summarizes the results. 

2. Binary Decomposition of the Domain 

In this section we describe the strategy used to partition a domain into subunits requir- 
ing equal computational effort. In this presentation we will assume that the number of avail- 
able processors is a power of 2, although many of our results generalize. Another underlying 
assumption is that the number of grid points N»p, the number of processors. We will con- 
centrate on the static case, and only say a few words about adaptively rebalancing the decom- 
position later in the section. 

Suppose that work estimates on a given domain have already been obtained, through a 
priori knowledge, or from an initial computation on a uniform mesh using a partitioning as in 
Fig. 1.2. Given these work estimates, we can now make a vertical cut through the domain so 
that the left and right segments each contain half the work (or as near as possible given the 
constraint that the line is vertical, and the number of grid points in each segment increases by 
a finite amount on shifting the location of the cut by one column). If there are four proces- 
sors available, the two segments are each partitioned using two different line segments of a 
horizontal cut line next, into a total of four equally balanced work loads. This procedure con- 
tinues by recursively partitioning using first vertical then horizontal cut line segments, so that 
the length of the longest side of any subregion is reduced every other step. A typical decom- 
position for the grids in Fig. 1.1 using 16 processors is shown schematically in Fig. 2.1. The 
idea for this decomposition was inspired by the similar looking rectangular regions used by 
Bentley [4] in answering two dimensional point domination questions. 

We emphasize that the computational work of an iteration on any rectangular region in 
Fig. 2.1 is identical. However, the communication requirements across the perimeters are 
not. In particular, if the source of the problem is a grid such as Fig. 1.1, then the grid point 
density along any given line segment varies, depending on whether a cut line intersects a 
finer grid or not. In general, therefore, we will only obtain upper bounds for communication 
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Fig. 2.1 A binary decomposition using 16 processors. 

costs assuming the worst case grid point distribution. Instead, therefore, we will now assume 
that the problem gives rise to different work requirements in different regions, but is based 
on an underlying grid of N 2 points which are uniformly distributed. 

There are several points to note about the decomposition depicted in Fig. 2.1. First, by 
restricting the subunits to be rectangular blocks, we avoid a messy problem with data struc- 
tures. If more general L-shaped regions or diagonal lines were used, the specification of a 
region would be more difficult. All that this approach requires to specify each block is the 
four corners of the rectangle (2 will do). This is sufficiently low overhead that every proces- 
sor can keep a map of the entire domain with each processor/rectangle pair. A tree data 
structure can easily be traversed for any neighbor information that is needed, for example, 
for a non-shared memory machine. 

Secondly, this approach does not suffer the drawbacks that other decompositions sug- 
gested for this problem have. For example, Fig. 2.2 indicates a grid configuration with four 
fine grids superimposed on a global coarse grid, and 2 further refined grids nested in 2 of the 
four. It is tempting to use these grids, which already form one type of decomposition of the 
domain (and are each regular with a simple data structure) as the basis for assignment of 
work to a processor. However, there is no attempt at load balancing in this approach. In 
addition, if the mesh is later changed so that there are 8 subgrids, for example, either the 
computation must request 8 processors, or some of them were idle beforehand. 

We mention that the partitioning itself is easily accomplished. In principle, the parti- 
tioning can be done by summing the numbers of grid points (or work per grid point times the 
number of grid points) first rowwise, then columnwise. The partitioning cut is then made in 
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Fig. 2.2 A domain decomposition with 4 subgrids, and 2 sub-subgrids, 
the middle. 

We point out one final advantage of the binary decomposition. As the computation 
proceeds, if it turns out that one region gets more work (say a finer mesh is introduced), a 
local rebalancing can be done without necessarily redoing the entire partitioning. For exam- 
ple, in Fig. 2.1 if region 14’s work load increases by some amount S, the last cut line in the 
partition, which separates regions 14 and 15, can be adjusted so that both regions have an 
imbalance of only 8/2. If the k previous cut lines are adjusted, the imbalance is reduced to 
8/2*. The cost of rebalancing (which might include time to send data to the new governing 
partition, for example), can be traded off against the lost time of having an imbalance in the 
work load. This can determine how high in the tree (the number k above) to rebalance. 

3. Analysis of Partitionings 

In this section we present several definitions related to our partitionings and analyze 
some of their important properties. These are essential to our discussion of mappings of 
subregions onto various processor architectures, which we present in subsequent sections. 

3.1. Definitions 

The depth of a partitioning is the number of times the domain has been partitioned. 
This equals the depth of the corresponding binary tree, with root node corresponding to the 
entire domain, and leaf nodes corresponding to each rectangle in the final partitioning. 
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Each partition line is divided into a number of segments by the incidence of other parti* 
tion lines. 

The total number of segments of a partitioning is the sum of all such segments. 

The depth of a segment is the depth of the partition line to which it belongs. 

For example, Fig. 3.1 shows a partitioning of depth 2. The depth 1 partition line a*b is 
divided into 3 segments by the incidence of the two depth 2 partition lines c-d and e*f . The 
total number of segments in this partitioning are 5. 


a 



b 


Fig. 3.1 A depth 2 partitioning with 5 segments. 

We observe that if two adjacent regions of a partitioning are assigned to different pro- 
cessors then the segment between them represents a communication requirement between the 
processors. The following definition makes this easier to appreciate. 

The graph of a partitioning is the dual graph obtained in the usual way [8] by represent- 
ing each region by a node and connecting two nodes if and only if the corresponding regions 
are adjacent (share a segment on their perimeter). Each edge in the graph of a partitioning 
represents a communication requirement between the two regions represented by the nodes at 
its end points. Fig. 3.2 shows a partitioning of depth 4 along with its graph. The assignment 
of regions of a partitioning to the processors of a multiple computer system is now equivalent 
to the mapping of the graph of a partitioning onto the graph of a multiple computer system 

m. 
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Fig. 3.2 A depth 4 partitioning and its dua] graph. 

3.2. Properties 

Clearly, the total number of segments of a partitioning is an important number in our 
analysis, since it represents the total number of different communication paths required. A 
lower bound on this number occurs when the domain has uniform computational density in 
which case the partitioning is made up of continuous horizontal and vertical lines extending 
from one side of the region to the other. It is easy to show that in this case the total number 
of segments T L is 

T L (k) = 2* +1 -2 * , k even (3.1) 

r L (k) = 2 i+1 - 3-2 2 , k odd , 
for a depth k partitioning for 2* processors. 

To obtain an upper bound T c {k) on the number of segments, we need to investigate further 
properties of partitionings. 

3.2.1. Property 1 

Note that the graph of a partitioning remains unchanged no matter how much the verti- 
cal (horizontal) partitioning lines are displaced as long as the sequence of the vertical (hor- 
izontal) coordinates of these lines are not disturbed. Because each edge of this graph 
corresponds to a segment, this property implies that the total number of segments is 
unchanged under such displacements. 
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3.2.2. Property 2 

A partitioning of depth k may be considered to be made up of the juxtaposition of two 
partitionings of depth Jt- 1. The line at which the constituent depth k- 1 partitionings abut 
each other becomes the new depth 1 line, and the depth of the remaining lines increases by 

1 l 1 l -i 

one. If Jt is even, the constituent depth k—1 partitionings will have 2 2 by 2 2 sides; in this 
case the sides with the larger number of regions will face each other. If A: is odd, the consti- 

k-i 

tuent regions have 2 2 regions along either side. 

Because of Property 1 above, we can displace the segments perpendicular to the inter- 
face of the constituent regions to any extent as long as we do not disturb their sequence. 
This allows us to distort the constituent partitions so that no two depth Jt-1 segments on 
either side of the interface are collinear, and thus obtain the maximum number of depth 1 
segments, which turns out to be 

4+x 

W*) = 2 2 -1, keven (3.2) 

*+i 

= 2 2 -1, k odd. 

It is possible for a partitioning of depth k to have the maximum possible number of 
depth 1 segments, independent of its constituent depth Jt— 1 partitionings. 

3.2.3. Property 3 

It thus follows that the depth k partitioning with maximum total segments is made up of 
two depth k— 1 partitionings with maximum total segments. This leads to the following 
recurrence for maximum total number of segments, 

—+t 

2V(Jt)=2 2 -1 + 2 7V(Jt-l), k even (3.3) 

k± 1 

7V(Jt)=2 2 -1 + 2 7V(Jt-l), k odd 
The solutions to these recurrences are 

£+2 

T C r(k) = 2* +2 - 2* - 2 2 + 1, Jt even 

ill 

T L r(k) = 2* +2 - 2* - 3*2 2 +1, Jt odd 


(3.4) 
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3.2.4. Property 4 

Another property of great interest is the maximum degree of any node in the graph of a 
partitioning. A simple counting argument shows that for 2* processors, *&4, the maximum 
degree is 

L L-i 

2 2 + 2 2 + 3, k even (3.5) 

*+i 

2 2 +3, kodd. 

The node with this maximum degree borders on the depth 1 partitioning line (see Fig. 3.3). 



Fig. 3.3 Maximum degree possible for a 2* processor decomposition, k=4. 


In fact, a simple counting procedure shows the following. For 2* processors, k even, 
the partition consists of k/2 segments of vertical cut lines interlaced with k/2 horizontal cut 
line segments. If we number the cut lines from j = 1 to k in total, then the number of 
incident edges a segment from a cut line of depth j can have is at most 


e k0) - 2 2 




— 1, Jfc even 


(3.6) 


e k (j) = 2 




- 1 , 


k odd. 


Using the fact that 7J 1 line segments make up the j ,h cut line, the expressions in (3.6) can be 
summed to provide an alternate derivation of (3.4) for the maximum number of edges in the 
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dual graph. 

4. Mapping onto Nearest Neighbor Arrays. 

In this section we investigate how the binary partitionings can be mapped onto nearest 
neighbor arrays. By nearest neighbor arrays we mean multicomputer systems in which the 
processors may be thought of as points on an integer lattice, where each processor has com- 
munication links to its 4, 8 or more nearest neighbors. For example, the Bliac machine (if 
we ignore its wraparound connections) is an example of a 4 nearest neighbor array (4nn 
array). The FEM [11] is an example of an 8nn array. 

In this analysis we will consider rectangular arrays of size 2*. For such arrays, there 
exists what we call natural mappings of depth k partitioned regions onto processors, defined 
as follows. When partitioning the domain into two regions, partition the processor array into 
two equal halves. Assign the left subdomain to the left half and the right subdomain to the 
right half. Repeat recursively until the processor partitions have exactly one processor in 
them. At this point every partitioned region has been assigned to a processor. 

4.1. Cardinality of Natural Mappings 

One way to measure the quality of a mapping is to compute its cardinality [7], defined 
as the number of edges of the problem graph that fall on edges of the processor graph 
divided by the total number of edges in the problem graph. Mappings with a cardinality of 
one have minimum interprocessor communication overhead since all processes that need to 
communicate lie on processors that are adjacent to each other. One such extreme case occurs 
when the domain being partitioned is uniform (as described in Section 3.2), and the graph of 
the partitioning matches the graph of a 4nn array perfectly. 

At the other extreme, decompositions of the type illustrated in Fig. 3.3 have graphs in 
which some nodes have exponential degree. We could not possibly acommodate the edges 
incident on such nodes using any fixed degree nearest neighbor array. The question then is 
how low the cardinality can be over all possible decompositions. 

A simple but important observation towards this goal is the following. In a natural 
mapping of the graph of a partitioning onto a 4nn array, the edges on the perimeter of any 
subdomain are all mapped on the corresponding edges of the mesh. This can be seen in Fig. 

2.1, where the 12 edges on the perimeter of the complete domain fall on the perimeter edges 
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of the array. At the same time, the 8 perimeter edges in the left and right hand subdomains 
also fall on perimeter edges of the two halves of the array and so on. 

It follows that the edges of the partitioning graph that fail to fall on edges of the array 
graph cannot exceed the number of edges that extend across partitions and are not perimeter 
edges. The number of such ‘misses’ that extend across the depth 1 partition is precisely the 
number of depth 1 segments (eq. (3.2)) less 2 (the perimeter edges), giving the following 
recurrences, 

M(k) = 2 2 — 3+2 M(k— 1), k even (4*1) 

k+l 

M{k) = 2 2 —3+2 Af(fc-l), kodd, k> 1. 


It is important to appreciate that for k— 1, M(k)=0. 

Restricting to the case of k even, these recurrences can be solved to yield 

M(k) = 3-2*" 1 - 2 T+J + 3. 


(4.2) 


Combining (4.2) with the expression for the total number of edges yields the cardinality 


C(k) = 


7V(fc)-M(fc) 

T v (k) 


= 3-2*" 1 — 2 


3*2* 



+ 1 


(4.3) 


with a similar expression for k odd. Clearly, when k = 1, C(Jt) = 1, and it drops gradually 
as k increases. It can be seen, however, that as k becomes large the cardinality converges to 
0.5 and does not drop further. For example, for *=4, C(Jfc)=.67, but for Jt=10, C(Jk)=.52. 
A similar analysis for the 8nn array gives the recurrence relation 

M(k) = 2 2 - 5 + 2 M(k- 1), which reveals that C(k) converges to .79. 

One question to consider is whether these natural mappings are near optimal, or 
whether there exists other mappings of regions to processors with higher cardinality. A 
worst-case partitioning can have T v edges, where T v is given by eq. (3.4). A 4nn array has 
only T l edges, where T L is given by eq. (3.1). An optimal mapping is one which uses all 4nn 

T -T 

edges, and thus has cardinality of at most - = 75%. In the worst case, there are 
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natural mappings that use all 4im edges, and have the maximum number of additional edges 
for a total of Tu, and so by this measure, natural mappings are within 2/3 of optimal. We 
conjecture that in fact, no other mapping can do better for a 4nn array. A similar analysis 
cannot be applied to 8nn arrays, since even in the worst case, the number of edges in these 
arrays exceeds the number of edges in the partitioning graph. An 8nn array is not well util- 
ized by our partitionings, since it has so many unused edges. 


4.2. Communication Requirements 

The cardinality expressions of the previous section are interesting, but do not give pre- 
cise expressions for the communication overhead. In this section, we obtain upper bounds 
for the total cost of communication when running our dissections on 4nn meshes. We 
assume that the solution method is as follows. All processes compute in parallel and, by con- 
struction, take the same amount of time. The communication step proceeds as follows. First 
the information to be communicated across vertical boundary segments is transmitted (first 
left then right) by each processor. Then this step is repeated for horizontal segments (top 
then bottom). Inspection of Fig. 3.2 reveals that when the dual graph of a binary dissection is 
naturally mapped onto a 4nn mesh, adjacent nodes of the dual graph are mapped onto nodes 
of the mesh that lie at least in adjacent rows or adjacent columns. This means that each com- 
munication step above has two phases. In the first phase, data makes at most one horizontal 
(vertical) movement to get to the correct column (row). In the second phase, it travels zero 
or more steps in the vertical (horizontal) direction to get to the correct row (column). The 
hardware communication mechanism at each node is assumed to be such that a unidirectional 
transmission of data can be performed on one communication link in one time step. Thus a 
two way exchange of data over a single communication line takes two time steps. 

By definition all processors have equal amounts of computation. It remains to evaluate 
how much communication overhead is incurred. Notice that in the best case of a simple uni- 
form partition, the length of the side of any square subdomain is 


L'unfformfytfl') 


N 

2*/2 ’ 


for a total communication time 


(4-4) 

k_ 

for a problem with N points on a side mapped on to a square mesh with 2 2 processors on a 
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side. In the case of non-uniform regions, the degree of distortion of a partitioning deter- 
mines the time required during the second phase of a communication step. This time is slight 
for mildly distorted partitions but can be a major factor in partitions with large distortion. 


4.2.1. Skewness of a partitioning 

To quantify the amount of distortion in a partitioning, we introduce the following con- 
cept. 


The jc-jfc?w 7 j«r(S , x ) of a given partitioning with N x N points and depth k is the ratio of 
the length of the longest horizontal side of any subdomain in that partitioning to the length of 
the side of the square in the corresponding uniform partitioning. 

The y-skewness(Sy) is similarly defined for vertical sides. 


For example in Fig. 3.3, S x is about 2.5 and S y about 4. 

We work under the assumption that there is always at least one point per processor. 

* - 

This constrains the skewness to lie between 1 and 2 2 (1 — 11 ). 


4.2.2. Dilation of edges in a dual graph. 

Skewness itself does not completely determine the communication overhead. A parti- 
tioning can be highly skewed yet have a dual graph that precisely matches a 4nn array. On 
the other hand, a partitioning can have this same skewness, but with a large mismatch 
between the graph of the partitioning and a 4nn mesh. 

To more accurately describe this mismatch we define the dilation of an edge in a dual 
graph that has been naturally mapped onto a 4nn mesh to be the number of edges that data 
passes through between two communicating processors during the second phase of communi- 
cations. 


The x-dilation, d x ( y-dilation d y ) of a partitioning is the maximum dilation over all edges 
in the x (y) direction. 

In uniform partitionings (with S x =S y = 1), the maximum dilation is zero. As skewness 
increases, the maximum possible dilation also increases. The biggest change occurs as the 

skewness increases from 1 to 2, since the maximum y-dilation increases from 0 to . The 

2 
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general expression for the worst case dilation is 



k_ 

The maximum possible y-dilation in a partitioning of depth k is 2 2 — 1. Here we assume that 
the first cut we make in our partitioning is always a vertical line. However it is still possible 
for a partitioning with very high skewness to have zero dilation. 

4.2.3. Data Transmitted Per Step 

The x and y dilations and skewnesses allow us to compute the time required for com- 
munication. Each data point can be transmitted to its destination row or column in one time 
step in phase 1 of the communication. Assume the hardware first takes care of all data that 
is to move in the x direction and then all data to be moved in the y direction. For a non- 
uniform partitioning, the maximum number of data points transmitted per communication 
step is 2S x L un \fo r m(h>N) in the x direction and 2S y L u „if onH (kJ l f) in the y direction, since each 
region has two vertical and two horizontal sides. The time required for phase one is then no 
more than 

Phase 2 of communication is complicated by the fact that there may be several overlap- 
ping communication paths in a single row or column. For example if the first processor in a 
row is transmitting to the 4th, then the 2nd might be transmitting to the 5th at the same time, 
causing congestion. Furthermore, the total time for communication is influenced by the x 
and y-dilations. 

4.2.4. Communication Strategies 

We propose two communication strategies for this troublesome phase two of communi- 
cation. These are the permutation strategy and the pipelined strategy. Both strategies are use- 
ful over the range of values of problem size N, depth of partitioning k and skewness S. 
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4.2.5. The permutation strategy. 

We may view phase 2 of communication as the permutation of a set of data on a chain 
of processors. Hach processor sends a data value to a processor at most d x (d v ) processors 
away in the x (y) directions. All processors can send one data value out to its destination in 
2*d x or 2 *d y time steps. The constant 2 arises because it takes 2 time units for a processor to 
receive and transmit one data point. The constant is not 4, which it would be for arbitrary 
permutations, because planarity insures no processor both transmits and receives in the same 
direction. The time for phase 2 of communications is thus 

Tperm = 2 * ( Max points per side) * ( Max dilation ). 
for each side. This works out to be 

r p . m /k,N)‘ 2 • S, 

Tptmjd'N)- 2 * Sy -~p$*dy 

The total time for phase 2 of communications is the sum of the two expressions multiplied by 
2, since each region has two vertical and two horizontal sides, 

Tllptrm = HS x *dg + Sy*d y ) (4.5) 


4.2.6. The pipelining strategy 

Instead of viewing phase 2 as a sequence of permutations, we can think of it as a 
sequence of data transfers in which each processor transmits all of its data points to all pro- 
cessors to which it needs to transmit in a pipelined fashion. That is, if processor 3 needs to 
send data to processors 5, 6 and 7, it pipelines this transfer so that as soon as it finishes send- 
ing off data intended for processor 7, it starts sending data for processor 6 etc. In this case it 
is impossible for, say, processor 5 to send data to processor 8 (should it need to do so) until 
processor 3 has finished. This situation can arise from the partitioning in Fig. 4.1a, where 
the configuration gives rise to two separate, overlapping chains of communication. A chain is 
a contiguous sequence of processors in a column (row) that all receive data from a single pro- 
cessor in an adjacent column (row). Fig. 4.1b shows the chains from the col umn in Fig. 4.1a. 

Thus there exists the problem of congestion which we define to be the number of over- 
lapping chains of communication in a given row or column. As might be expected, 
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The total time for phase 2 of communications is again the sum of the two expressions multi- 
plied by 2, giving 


W - + •O'e.w + 4 (s,^r + <)-Gv(*). <4.7) 

Comparison of (4.5) and (4.7) shows that the permutation strategy is always preferable 
for partitions with low skewness, when 5s 2. For 5> 2 and low to moderate depth of parti- 
tioning, pipelining is better, since in this case the congestion G is small. Fig. 4.2 shows 
graphically the ratio of the respective costs of these communication strategies, for a problem 
where N = 4096 is the number of points on a side. 


□ < 0.25 
ED 0.25 - 0.50 
■ 0.51 - 1.00 
SB 1.01 - 2.00 
S 8 S 2.01 - 4.00 

m > 4.00 


4 6 8 10 12 14 16 18 20 22 
depth of partitioning, k ► 

Fig. 4.2 Ratio of cost of pipeline to permutation strategy for communicating processors. 



5. Mapping onto Trees 

In this section we consider the costs of mapping the partitions onto binary trees. At 
first sight, tree structured multiprocessors would appear unsuitable for grid problems, 
because of potential traffic bottlenecks at the root. However, they are natural to consider in 
this case since we use a binary decomposition of the domain. Our results show that in the 
worst case, with 2 k processors, the performance of binary trees is within a factor of a con- 
stant times k of the mesh performance. 
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In our model, the leaf nodes do all the computation, and the rest of the nodes are used 
only for communication. Fig. 5.1 indicates how the partitions are matched with the leaf 
nodes. Regions that are separated by the last depth k partitioning cut are mapped to adjacent 
leaf nodes. Regions separated by the first partitioning cut are in different halves of the tree. 

The solution algorithm starts with all leaf nodes computing on their respective sub- 
domains. The communication step can be thought of as having k phases. In the first phase, 
the leaf nodes send up all data that must pass through the root node. This indudes all nodes 
that border the first, depth 1 partition cut, and takes time proportional to the length of that 
line. In the second phase, it sends data that rises no higher than the two children of the root 
node. This communication is between nodes sharing one of the two depth 2 boundary seg- 
ments, and takes time proportional to the length of the depth 2 segment. In the k ,h and last 
phase, leaf nodes sharing a depth k boundary segment swap data. If each phase proceeds to 
completion before the next phase starts, the communication time for phase j is proportional 
to the length of the maximum depth j segment. The total communication time is then propor- 
tional to the sum of these, and has latency k 2 through the tree. Instead, a leaf node can start 
the next phase of communication as soon as it is ready. This pipelining gives both a smaller 
communication time and a smaller latency through the tree of 2fc-l. We define a hyperper- 
imeter, (in analogy with the perimeter estimates for the meshes), as ff(*) = 2 where 

J - 1 

lj = any one line segment of depth j . Fig. 5.2 shows the maximum length hyperperimeter 
in the given partitioning. If the communication is pipelined, the hyperperimeter is a con- 
nected series of line segments. If instead, each communication phase proceeds to completion 
before the next phase starts, the maximum hyperperimeter need not be connected. Instead of 
the maximum of the sums, we would take the sum of the maximum length segments for each 
depth j. 

To have a fair comparison with the performance of meshes, we assume that each non- 
leaf node can only receive or transmit in one direction over one link in one unit of time. 
Thus, a node can receive 3 values on its input lines and transmit 3 values on its output lines 
in 6 units of time. In the worst case therefore, buffers of size no greater than 2N are 
required. The total communication cost is then 6 hyperperimeter + latency. For example, 
for a uniform domain decomposition the maximum hyperperimeter is 

N+ J + J + • • ’ +_ ^ +_ r7 + "T = tf[3 — t*]. Thus the total time 
2 2 1 2 2 ’ 1 2 r 2 T 
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Fig. 5.1 A binary decomposition mapped onto a tree of processors. 

T tt ntfor,n(W) = 18 N (1-2~ T ) + 2k— 1. (5.1) 

Clearly, this is much worse than the corresponding expression (4.4) for a mesh. 



Fig. 5.2 Darkened hyperperimeter shows the maximum communication requirement in 


the tree. 
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For a non-uniform partition, in the worst case the hyperperimeter can grow like 
N+ (N — 1)+ (N — 2) • • • , except there must be at least 1 point per region. For k = 4, for 
example, the hyperperimeter is N+(N-2)+(N— 2)+(N— 3). In general, the expression is 

H(k) = kN - 2 2 (k- 3) - 3 for a total time of 

£ 

T(k,N) = 6Jb» - 3-2 2 (Ar — 3) — 9-t-(2Jk — 1) (5.2) 

Notice that the latency is essentially irrelevant. In comparison with eq. (4.7) for nearest 
neighbor meshes, where the leading term is 4A/, the trees are a factor — worse. However, a 

naive analysis of trees gives a worst case bound of 6k?N, which is avoided here by using the 
extremely ordered properties of the binary decomposition. 

These communication estimates can be used to determine the optimal number of parti- 
tions to use to solve a given sized problem, for a given efficiency. We do a sample calcula- 
tion for the uniformly partitioned case. For an N by N square grid, the amount of computa- 
tion to be done using 1 processor is W : = C^N 2 . For a rather simple method, C : might 
include 20 floating point operations per point. For 2* processors, the amount of work per 
processor is 

W 2t = CyNW + C 2 -18N, 

where we have dropped the lower order terms in the communication cost. The speedup is 



and the efficiency is 

E = S_ „ 1 

2 * i . £l 18 2* ‘ 

C x N 

For example, if the communication time for one item C 2 takes approximately the same time 
as one floating point operation, then if N = 10 2 , and Jfc = 6, the efficiency is only 63%. For 
an efficiency of 87%, k — 4, or 16 processors should be used. 
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6. Communication Cost Analysis for Hypercubes. 

An analysis similar to the one for nearest neighbor arrays can be performed for hyper- 

*. L 

cubes. A nearest neighbor array of size 2 2 by 2 2 can be easily embedded in a hypercube of 
dimension k using the well known Gray code mapping method. Fig. 6.1 shows a 16 x 16 pro- 
cessor array that has been mapped onto a hypercube of dimension 8. All the edges of the 
mesh but only some edges of the hypercube have been shown. Specifically, only the edges of 
the hypercube that connect nodes that lie along one row and column of the 4nn mesh are 
shown. These edges demonstrate the richer interconnection of the hypercube in relation to 
the mesh. We would expect this richer interconnection to reduce the communication over- 
head when running our dissections. 



Fig. 6.1 A 16 by 16 mesh mapped onto a hypercube of dimension 8. Only some hyper- 
cube edges are shown. 
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6.1. Mapping onto Hypercubes. 

To map the dual graph of a partitioning onto a hypercube, first map the dual graph onto 
a mesh, and then use the Gray codes to embed the mesh in the hypercube. The cardinality of 
the mapping in this case is no better than that obtained when a 4nn mesh is used. However, 
the richer interconnection of the hypercube leads to a smaller communication overhead when 
a detailed analysis is done. 

6.2. Communication strategies. 

As before, we assume a processor can send or receive only one item on one link at a 
time. We believe this is an accurate model of a scalable multiprocessor. A hypercube that 
can transmit on all links at one time is realizable only for a fixed dimension, but cannot be 
extended to higher dimensions. We again use two different communication strategies. In the 
case of the permutation strategy we take advantage of the logarithmic time to permute data 
in a hypercube. For the pipelining strategy we exploit the logarithmic diameter of the hyper- 
cube. 


6.2.1. Permutation Strategy. 

A hypercube of dimension k can perform any permutation of 2 k elements, one per pro- 
cessor, in 4k- 1 time, under our communication assumptions [16]. We can do better than this 
for our problem by noting that the edges connecting all the nodes in a single row or column 
of a mesh embedded in a hypercube (see Fig. 6.1) form a sub-hypercube of dimension k/2. 
Recall that data flows only along columns or rows during phase 2 of the communication step. 
Thus each permutation step takes at most 2it — 1 time. This is a worst case analysis, how- 
ever, and may not be optimal, since it does not depend on the dilation of a given partitioning. 
We obtain the following expressions, 

- 1 ) 

— 1 ) 

for a total phase n time of 

T n,<rm = 2*(2k-l)*(s x + S } ) (6.1) 


These correspond to (4.5) for meshes. 
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6.2.2. Pipelining Strategy 

In the case of hypercubes, the pipelining strategy utilizes the logarithmic diameter of 
hypercubes to improve communication times. If we examine a contiguous subchain of d x 
nodes in a row or column, it is easy to verify that there exists a tree of diameter no more 
than 1+ Jlog 2 (J r )j rooted at every node, that reaches every other node in the chain.. This 

tree may be found by doing a breadth first search outwards from the desired node, staying 
within the graph induced by the contiguous subchain. Thus the requirement of the pipelining 
strategy that one node transmits to several other nodes in a subchain takes no more than 
2+ J^21og(d x )| time. The problem of congestion remains. The amount of congestion is pre- 
cisely the same as with 4nn meshes. This is because the problem and not the multicomputer 
architecture determines the congestion. In the hypercube, a number of trees are pumping 
data out towards their leaves in parallel, instead of a number of chains linearly pumping data 
out towards their end points, as was the case for meshes. The time required is 

T rlp ,AW)-(S,-^ + 2I«j 2 («),)+2) * GJk) 

T, lp .AW=G r $; + 2lo Sl (d y )+2) • Gfi) 

for a total time of 

Tntipe = 2 + 21og(d,)+2)G x + 2-(S y -^ + 2-log(d > .) + 2)G r (6.2) 

Fig. 6.2 shows the ratio of the pipelining to permutation communication costs in a 
hypercube. For any fixed skewness, the pipelining strategy is initially cheaper, but as with 
meshes, the permutation strategy eventually wins with increasing depth of partitioning. 

We compare the communication costs in a nearest neighbor array versus a hypercube in 
Fig. 6.3. For each type of machine, the cost used is the cheaper of the permutation and pipe- 
lining costs, for the given parameters. For a large range of skewness and low depth of parti- 
tioning, the mesh is not much worse than a hypercube. For high depth of partitioning and 
moderate skewness, the hypercube is much better, but its performance approaches that of a 
mesh for very low and very high skewness. For large skewness, there is a region with a 
large perimeter which takes a long time to transmit. The cost of this dominates both the 
square root and logarithmic communication latencies of the mesh and hypercube respectively. 
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Fig. 6.2 Ratio of cost of pipeline to permutation strategy for a hypercube. 


7. Conclusions 

We have shown how domains with non-uniform workloads can be partitioned to 
equidistribute the computational load. For these types of grid problems, our binary decom- 
positions can be mapped in a natural way onto trees, nearest neighbor meshes and hyper- 
cubes. 

For 4nn arrays, it is interesting that the map from the problem graph onto the array 
graph has at least a 50% hit rate of edges. For an 8nn array this is 79%. 

We further evaluated the performance by analyzing the traffic through these networks. 
For an N by N problem using 2* processors in the worst cases, the communication overhead 
using the pipelined strategy, was approximately 4N for hypercubes, SN for meshes, and 6kN 
for trees. The performance of trees was found to be better than a naive analysis would sug- 
gest. While these results are encouraging, certainly a better approach is to partition using a 
weighted sum of computational effort and communication costs. In addition, the more diffi- 
cult problem of adaptive load balancing will have to confront the problems of modifying data 
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Fig. 6.3 Ratio of the communication cost of a nearest neighbor array and a hypercube. 


structures within each processor. It will clearly be more efficient to tolerate small amounts of 
load imbalance than to change partitions with every perturbation. 

We believe the decomposition technique presented here would be beneficial even on 
shared memory machines such as the Ultracomputer [10] or the IBM RP3 [14]. Our parti- 
tionings allow efficient load balancing across processors, without the overhead of a fine- 
grained queueing mechanism that would otherwise be necessary. They would also reduce 
memory traffic and increase the cache hit rate. 
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